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Abstract 

This  paper  introduces  lattice  based  language  models,  a  new  language  model¬ 
ing  paradigm.  These  models  construct  multi-dimensional  hierarchies  of  par¬ 
titions  and  select  the  most  promising  partitions  to  generate  the  estimated 
distributions.  We  discussed  a  specific  two  dimensional  lattice  and  propose 
two  primary  features  to  measure  the  usefulness  of  each  node:  the  training-set 
history  count  and  the  smoothed  entropy  of  its  prediction.  Smoothing  tech¬ 
niques  are  reviewed  and  a  generalization  of  the  conventional  backoff  strategy 
to  multiple  dimensions  is  proposed.  Preliminary  experimental  results  are 
obtained  on  the  SWITCHBOARD  corpus  which  lead  to  a  6.5  %  perplexity 
reduction  over  a  word  trigram  model. 
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1  Introduction 


Statistical  language  modeling  is  concerned  with  estimating  the  probability 
of  various  linguistic  events,  using  large  samples  of  language  data.  As  used  in 
automatic  speech  recognition,  statistical  language  models  typically  estimate 
Pr{w\h)  -  the  conditional  distribution  of  the  identity  of  the  next  word  in  a 
sentence  or  document,  given  the  current  history  (namely  the  identity  of  the 
words  that  occurred  up  to  this  point). 

The  most  common  statistical  language  model  is  the  N-gram,  which  makes 
the  simplifying  assumption: 

Pr(u;|h)  =  '?x{wi\wi,W2, . . .  ^  PT{wi\wi-N+i,  •  ■  ■  ,Wi-i) 

N-gram  models  have  dominated  statistical  language  modeling  ever  since 
their  introduction  in  the  1970’s  [1].  In  spite  of  their  apparent  limitations, 
N-gram  models  proved  simple,  robust,  and  surprisingly  hard  to  improve  on 
([2]). 

Within  the  N-gram  paradigm,  much  work  was  done  on  smoothing,  word 
clustering  and  adaptation.  In  smoothing,  the  dominant  ideas  are  those  of 
discouting  ([3,  4]),  and  backing  off  to  ([5]),  or  linear  interpolation  with  ([!]), 
lower  order  models.  In  clustering  words,  most  algorithms  use  iterative  meth¬ 
ods  that  greedily  attempt  to  minimize  local  information  theoretic  measures 
([6,  7]).  N-gram  based  adaptation  work  consists  primarily  of  variations  on 
interpolating  the  static  model  with  small  N-grams  built  from  more  pertinent 
data  (or  from  the  document’s  history,  i.e.  a  cache)  ([8,  9,  10,  11]). 

During  the  last  decade,  several  attempts  were  made  to  break  away  from 
the  N-gram  paradigm.  These  include  decision  trees  and  maximum  entropy 
models. 

Decision  tree  language  models  ([12])  are  the  ultimate  in  partition  based 
modeling,  because  they  can  implement  arbitrary  partitions.  But  this  rich¬ 
ness  is  also  the  source  of  their  main  weakness,  which  is  the  computational 
intractability  of  finding  the  optimal  tree.  This  leads  to  greedy  searches  and 
other  algorithmic  and  modeling  compromises,  which  affect  the  quality  of  the 
resulting  model.  As  a  consequence,  decision  trees  have  not  yet  succeeded  in 
significantly  improving  on  the  baseline  N-gram  model. 

Another  problem  with  decision  trees  is  data  fragmentation.  Once  a  tree 
has  been  constructed,  each  history  fits  into  exactly  one  leaf,  and  the  resulting 
estimation  is  based  only  on  the  training  data  that  belong  on  the  path  from 
the  root  to  that  leaf.  No  use  is  made  of  training  data  which  may  be  intimately 
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related  to  the  current  situation  but  which  happens  to  diverge  early  on  into 
other  paths  due  to  orthogonal  questions  higher  up  in  the  tree. 

The  ability  to  combine  evidence  based  on  diverse  and  partially  overlap¬ 
ping  features  was  the  main  motivation  behind  the  introduction  of  the  maxi¬ 
mum  entropy  paradigm  to  language  modeling  ([13,  14,  15,  16,  17]).  As  was 
demonstrated  in  ([15]),  maximum  entropy  models  can  successfully  integrate 
diverse  knowledge  sources  in  a  unified  and  consistent  statistical  framework, 
and  can  result  in  significant  improvement  over  the  existing  state-of-the-art 
N-gram  based  techniques.  However,  as  was  also  discussed  in  ([15]),  training 
maximum  entropy  models  is  computationally  very  demanding,  which  renders 
them  of  little  use  when  large  amounts  of  data  (e.g.  a  hundred  millions  words) 
are  available. 

This  report  discusses  lattice  based  language  models  —  an  alternative 
language  modeling  paradigm  which  we  have  just  started  exploring.  Like  a 
decision  tree,  a  lattice  is  based  on  a  set  of  partitions  of  the  history,  and  like  an 
N-gram,  the  set  of  partitions  is  strongly  constrained  by  word  order  and  word 
ordinal  position.  But  unlike  a  decision  tree,  estimates  are  constructed  using 
multiple  partitions  which  may  or  may  not  be  refinements  of  each  other.  This 
allows  multiple,  partially  overlapping  knowledge  sources  to  be  incorporated, 
as  in  a  maximum  entropy  model.  But  unlike  the  latter,  training  a  lattice 
based  model  is  not  computationally  demanding. 


2  Outline 

Classical  N-gram  models  define  a  particular  partitioning  of  the  history  space. 
For  example  a  3-gram  model  is  defined  as  P{wi\h)  =  P(u>i|tui_2u;i_i)  where 
all  histories  h  sharing  the  same  last  two  words  are  considered  to  be  equiv¬ 
alent.  Another  way  of  partitioning  the  history  space  relies  on  word  classes. 
For  example  a  class  3-gram  modeF,  can  be  defined  as  P{wi\gi.2gi-i)  where 
the  history  is  seen  as  a  sequence  of  classes  {gi.2gi-i).  Similarly  this  class 
history  can  be  made  more  coarse  by  clustering  the  original  classes  into  su¬ 
perclasses.  Thus,  in  addition  to  the  model  order,  the  definition  of  a  hierarchy 
of  classes  allows  for  a  natural  extension  to  N-gram  models  in  which  the  space 
of  histories  can  be  partitioned.  This  idea  is  described  in  section  3  where  lat¬ 
tices  of  N-gram  models  are  introduced.  Analysis  of  this  approach  as  used 
on  the  Switchboard  data  is  given  in  section  4.  The  definition  of  lattices  of 

^Classes  are  introduced  here  only  for  defining  various  partitions  of  the  history  space. 
The  resulting  model,  i.e.  p{wi\gi^29i-i),  contrasts  with  a  more  traditional  class  3-gram 
where  the  prediction  is  made  in  two  steps:  p(wdfl'i)p(^dfl'j-2ifi-i)- 
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N-gram  models  also  suggests  an  extension  to  the  backoff  smoothing,  called 
fico-dimensional  backoff  which  is  detailed  in  section  5. 

Linear  combination  of  predictors  is  static  in  the  sense  that  the  interpola¬ 
tion  weights  remain  fixed  aftc'r  tlieir  estimation.  A  backoff  model  can  also  be 
considered  static  since  the  most  s]>ecific  predictor  is  always  used  when  avail¬ 
able^.  Here  we  investigate  language  models  that  dynamically  choose  among 
a  lairjf  set  of  predictors.  In  otlu'r  words  the  combination  of  various  predic¬ 
tors  de|)ends  on  their  estimated  (jiiality  in  a  given  context.  These  ideas  are 
developed  in  section  6. 


3  Lattice  of  N-gram  models 

3.1  Model  definition 

.A  lattice  of  N-gram  mod('ls  i^  sliown  in  figure  1.  In  this  particular  example, 
17  ])redictors  are  consid('r<'d.  'I’li(\v  correspond  to  five  model  orders  (from 
5-gram  on  the  left  to  unigram  *)n  the  right)  and  a  hierarchy  of  four  class 
levels  (the  word  level  at  the  bi>ti*)ni  and  the  coarsest  class  level  at  the  top). 
The  different  class  levels  collapse  to  one  predictor  in  the  unigram  case,  in 
which  all  histories  are  ma|>pe(l  to  one  equivalence  class. 

The  lattice  structure  r<-pn-'ents  a  set  of  inclusion  relations.  In  particu¬ 
lar.  the  hierarchy  of  classe--  deliiK's  the  following  inclusions  in  the  space  of 
histories: 


{»•  -  I  }  A  C  Gi-i  C  ... 

or  similarly 

)}  -  I )}  C  {(Gi-2Gi-i)}  C  ... 

The  order  of  the  model  aho  deliiies  the  following  inclusions  in  the  space 
of  histories 

or  similarly 

{{gi-4gi-3gi-2(l<~\ )}  ^  {{gi-3gi-29i-i)}  Q 

^Even  backoff  models  with  ml  offs  [18]  are  static  as  long  as  the  cutoflf  values  are  fixed 
for  all  histories. 
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Figure  1:  Lattice  of  language  models 

Each  node  of  the  lattice  represents  a  particular  predictor  which,  for 
a  given  history  h  and  a  given  predicted  word  w,  is  associated  with  two 
counts  C{h,w)  and  C{h).  A  maximum  likelihood  estimate®  of  the  proba¬ 
bility  P{w\h)  is  given  by  the  ratio  of  these  counts:  p{w\h)  =  Thus 

apart  from  the  important  issue  of  smoothing  the  probability  estimate  is  com¬ 
pletely  defined  with  two  counts. 

The  lattice  structure  reflects  the  partial  ordering  between  predictors  to¬ 
gether  with  the  specificity  of  each.  In  our  example,  the  5-gram  at  the  word 
level  is  the  most  specific  predictor  while  the  unigram  is  the  least  specific 
one.  In  other  words,  the  lattice  can  also  be  seen  as  a  DAG  in  which  any 
path  goes  from  a  more  specific  to  a  less  specific  predictor.  The  traditional 
backoff  model  becomes  a  particular  case  in  which  the  lattice  is  reduced  to 
one  dimension  (associated  with  the  model  order)  and  backing  off  consists  of 
moving  towards  a  less  specific  predictor. 

While  the  unigram  is  the  least  specific  model,  it  is  also  the  most  reliable, 
as  it  is  estimated  with  the  largest  amount  of  data.  Our  ultimate  goal  is 
to  find  the  best  combination  of  predictors  that  trades  specificity  off  against 
reliability.  More  qualitative  measures  related  to  theses  questions  are  defined 
in  section  3.3. 

Possible  extensions  to  this  model  include  more  flexible  (and  numerous) 
history  partitions,  i.e.  where  every  position  in  the  history  can  come  from 

®In  this  paper  an  uppercase  P  denotes  a  true  probability  and  a  lowercase  p  denotes  a 
probability  estimate. 
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a  different  class  level.  Other  dimensions  can  also  be  considered  since  each 
predictor  can  be  estimated  on  a  hierarchy  of  corpora  or  on  a  topic  tree. 


3.2  Independent  predictors 


C(w  w  ) 


Figure  2:  Lattice  of  history  counts 

The  estimates  represented  in  the  lattice  nodes  as  described  so  far  are  depen¬ 
dent  predictors.  For  instance,  the  set  of  3-gram  histories  represented  in  the 
lattice  are  a  subset  of  the  2-gram  histories.  Under  some  circumstances,  it  is 
conceivable  that  we  would  want  to  make  statistical  measures  on  lattice  nodes 
as  if  they  were  not  dependent  on  each  other.  For  instance,  testing  the  hy¬ 
pothesis  that  adjacent  nodes’  distributions  are  identical  would  demand  that 
comparison  data  would  be  drawn  from  independent  distributions.  For  this 
reason  we  developed  the  following  method  for  construction  of  independent  es¬ 
timators  from  the  dependent  estimators.  While  this  construction  is  not  used 
in  any  of  the  following  experiments,  we  have  included  its  derivation  here  for 
the  sake  of  completeness. 

Each  predictor  in  the  lattice  is  characterized,  for  a  given  history  h  and  a 
given  predicted  word  w,  by  two  counts  C (h,  w)  and  C (h).  Let  us  first  consider 
the  history  counts  at  each  node  as  represented  in  figure  2.  N  denotes  the 
training  set  size,  i.e.  the  unigram  “history”  count,  and  C {wi-2Wi-i)  denotes 
the  history  count  for  a  particular  word  3-gram.  For  simplicity,  we  assume 
here  that  the  word  trigram  is  the  most  specific  predictor^. 

Along  the  horizontal  dimension  we  can  build  independent  predictors  in 
the  following  way.  The  word  bigram  clearly  depends  on  the  word  trigram 
since  {{wi-2Wi-i)}  C  but  one  can  compute  an  independent  “bigram” 

by  a  difference  of  two  sets.  A  corrected  count  C{wi-i)  is  obtained  by  subtract¬ 
ing  the  count  from  the  original  bigram  history  count  C{wi-i): 

'*The  proposed  calculation  can  be  extended  trivially  to  lattices  of  any  order. 
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C{wi-l)  =  C{wi^i)  —  C{Wi^2Wi-i)  (1) 

The  corrected  count  can  be  written  more  accurately  as  C(wiPiWi^i)  where 
n',-2  denotes  any  word  except  Wi-2- 

Tliis  subtraction  operation  can  be  applied  to  other  counts  in  the  same 
wa>'.  For  instance,  the  corrected  unigram  count  becomes  N: 

=  7V-C'K_i)  (2) 

In  both  cases,  counts  of  independent  events  can  be  obtained  by  subtract¬ 
ing  adjacent  counts  in  the  lattice.  The  same  result  applies  to  the  vertical 
dimension  since 


G{gi-2gi-i)  =  C{gi-2gi-i)  -  C[wi^2Wi-i)  (3) 

and 


C[Gi-2Gi-i)  =  C{Gi-2Gi-i)  —  G{wi-2Wi-i) 

—G{gi-2gi-i)  -I-  C{(wi^2Wi-i)  n  {gi_2gi-i)}  (4) 

=  C{Gi-2Gi-i)  —  C{gi-2gi-i) 

A  similar  computation  can  be  performed  for  predictors  which  have  more 
than  one  predecessor  in  the  lattice.  Following  the  same  reasoning  as  before, 
a  corrected  count  can  be  obtained  from  the  original  bigram  history 

gi-i  by  subtracting  the  two  adjacent  counts,  C{wi_i)  and  and 

by  adding  the  count  of  their  intersection  C{{wi-i)  fl  igi-2gi-i)}-- 

Gigi-i)  =  C{gi^i)  -  -  C{gi-2gi-i) 

-\-C{{wi.i)r\{gi^2gi-i)} 

=  G{gi^i)  -  C{wi_i)  -  c{gi^2gi-\) 

'^G  {gi-.2Wi^i) 

Here  an  additional  count,  i.e.  C(^gi—2'i^i—i)i  must  be  collected  for  each  esti¬ 
mate^ 

Finally,  the  same  reasoning  may  be  followed  to  produce  corrected  joint 
counts  C{h,w)  where  the  snbtraction  operations  are  parallel  to  the  ones 
described  above.  The  partitioning  of  the  history  space  corresponds  now  to 
mutually  exclusive  instead  of  inclusive  sets. 

®The  unigram  predictor  has  also  more  than  one  predecessor  in  the  lattice,  but  in  this 
case  the  final  corrected  count  is  given  bj'  N  =  C{Gi-i). 
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3.3  Entropy  of  smoothed  distribution 

As  mentioned  in  section  3,  we  are  interested  in  measuring  both  the  reliability 
and  specificity  of  each  of  the  predictors  in  the  lattice.  We  can  assume  that 
high  count  histories  will  be  reliably  estimated.  Consequently  we  will  use  the 
history  count  C{h)  itself  as  our  reliability  measure.  To  capture  the  notion  of 
specificity  we  would  have  to  consider  the  distance  between  the  distribution 
associated  with  a  predictor  and  the  true,  but  unknown,  distribution  P{w\h). 
A  related  notion  is  the  usefulness  of  the  prediction  which  can  be  measured 
as  the  entropy  of  the  history  H{h),  that  is  the  entropy  of  the  estimated 
distribution  p{w\h)  to  predict  any  word  w  from  a  fixed  history  h: 


=  -  X!  P{^\h)  (6) 

W 

In  general,  the  estimate  p{w\h)  must  be  smoothed  since  for  many  pre¬ 
dictors  there  is  so  little  data  that  the  entropy  estimate  is  highly  unreliable. 
However,  smoothing  is  usually  performed  by  combining  several  predictors. 
In  the  proposed  framework  this  would  correspond  to  considering  the  same 
sequence  of  tokens  as  members  of  different  history  partition  classes.  Such 
a  smoothing  mechanism  would  no  longer  allow  for  measuring  the  entropy 
of  a  particular  history,  but  rather  of  combined  histories.  A  solution  to  this 
problem  consists  in  smoothing  by  absolute  discounting  and  backing  off  to  the 
unigram  distribution  p{w)  as  described  in  equation  7. 


\  a{h)p{w)  otherwise 


(7) 


where  d  denotes  the  discounting  value  (typically  0.5)  and  a{h)  is  a  normal¬ 
izing  factor. 

This  simple  smoothing  technique  combines  only  the  original  predictor 
with  the  unigram  in  such  way  that  comparison  with  other  predictors  is  mean¬ 
ingful.  A  high  entropy  value  indicates  a  flat  distribution  for  this  particular 
history  while  a  low  entropy  indicates  a  sharp  distribution. 

Another  advantage  of  the  proposed  smoothing  is  the  low  cost  of  the  en¬ 
tropy  calculation®.  Let  Hi  denote  the  entropy  of  the  unigram  distribution, 
i.e.  Hi  =  —  logp(w),  which  needs  to  be  computed  only  once.  The 

entropy  calculation  can  then  be  rewritten  as  in  equation  8,  in  which  all  sum- 


®The  entropy  calculation  for  the  independent  estimators  described  in  section  3.2  would 
be  much  more  costly  than  that  described  here.  This  is  the  main  reason  that  the  indepen¬ 
dent  estimator  formulation  was  not  used  in  the  experiments  described  in  this  paper. 
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mations  are  over  the  set  of  words  w  such  that  C{h,  w)  >  0.  This  set  of  words 
generally  represents  a  very  small  fraction  of  the  vocabulary. 


H{h) 


where 


-  p{w\h)  log  p{w\h) 

w:C{h,w)>0 

-  E  oc{h)p{w)\og{a{h)p{w)) 

w:C(h^w)—0 

-  E  p{w\h)  log  p{w\h) 

w:C(h,w)'>0 

-{■a.{h)[  Hi-\-  p{w)logp{w) 

w:C(h,w)'>0 

loga(h)((  E  pH)-1)] 

w:C{h,w)'>0 


a{h) 


w)>0 


1  -  E  p{w) 


(8) 


3.4  Hierarchical  clustering 

The  definition  of  a  lattice  of  N-gram  models  relies  on  the  development  of 
word  classes  which  deterministically  map  a  word  to  a  class  g{w).  This 
mapping  can  be  automatically  constructed  by  a  clustering  algorithm  such  as 
the  one  proposed  by  Kneser  and  Ney  [19].  Its  objective  is  to  find  a  mapping 
such  that  an  associated  class  bigram  model^  has  a  locally  minimal  perplexity 
on  the  training  data.  This  criterion  can  be  shown  to  be  equivalent  to  the 
local  minimization  of  the  loss  of  mutual  information  between  words. 

Ney’s  algorithm  does  not  construct  a  hierarchical  clustering  since  the 
number  of  classes  is  fixed  a  priori*.  However,  a  hierarchy  of  classes  can  be 
obtained  by  relabeling  the  training  data  according  to  the  estimated  word- 
to-class  mapping  and  by  iterating  the  clustering  with  a  smaller  number  of 
classes.  Figure  3  shows  a  typical  example  of  hierarchical  clusters  constructed 
from  the  Switchboard  data,  where  the  number  of  classes  used  was  1600,  300 
and  50,  respectively. 

’’For  a  class  bigram  model  the  probability  of  the  next  word  is  given  by 
p{wi\g{wi))p{g{wi)\g{wi_i)) 

^Notice  that  the  optimal  number  of  classes  for  a  single  class  bigram  model  can  be 
estimated  on  a  held-out  set  or  by  leaving-one-out. 
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when  whenever  maybe  consequently  perhaps  unless  whether  If  what  how  where  why 
wherever  hofsefuMy  desperately 

shall  therefore  demanding 

afterward  bounce 


thus 

decreasing 


Figure  3:  irK'rarchical  clustering  example 

4  Data  Analysis 

4.1  Switchboard  data 

The  Switchboard  data  used  in  ih«'  |»i<'sent  work  consists  of  about  2.5  million 
words  of  transcribed  coii\'er'';ii ioiial  speech  [20].  We  chose  a  vocabulary  of 
9802  words  corresponding,  to  ii  «  u\'ercige  of  98.5  %.  This  vocabulary  is  closed 
as  it  contains  the  special  token  I'N'K  to  which  any  out  of  vocabulary  word  is 
mapped. 

For  the  current  experiiiieutx  ilie  data  were  randomly  split  into  3  sets. 
The  first  set  forms  the  tiainiiii!  data  from  which  the  counts  are  computed. 
The  second  set  is  a  held-(»iii  xei  used  for  analysis  and  additional  parameter 
estimation.  Finally  the  text  x,-t  i^  iis<“d  for  evaluating  the  perplexity  of  the 
proposed  models.  Table  J  Miinmarizc's  the  number  of  sentences  and  word  to¬ 
kens  in  these  data  sets.  A  liierar<'hy  of  classes  was  also  built  from  the  training 
data  with  respectively  IGtKt.  3(l(t  and  50  classes  as  described  in  section  3.4. 


Dataset 

^  sfMilences 

^  words 

Trainin'.' 

1-10.807 

2, .365, 741 

Held-<»ut 

9.000 

151,346 

Test  1 

2.568 

39,956 

Table  1:  Switchboard  data 


4.2  History  and  prediction  hit  ratios 

For  each  lattice  node  and  for  each  word  to  be  predicted  in  the  held-out  set, 
the  two  counts  C{h)  and  C(li.  ic)  can  be  computed  on  the  training  data.  The 
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34%  67%  90%  98% 


5-gram  4-gram  3-gram  2-gram  1-gram 


Figure  4;  History  and  prediction  hit  ratios 

history  hit  ratio  is  the  fraction  of  the  time  a  particular  history,  observed  in 
the  held-out  set,  was  already  seen  in  the  training  set  (i.e.  C{h)  >  0).  The 
prediction  hit  ratio  is  the  fraction  of  the  time  the  next  word  was  already  seen 
after  that  particular  history  in  the  training  set  (i.e.  C{h,w)  >  0). 

Figure  4  gives  the  history  and  prediction  hit  ratios  on  the  held-out  set. 
For  example,  at  the  1,600  class  level  (C1600)  the  4-gram  histories  h  were 
already  observed  66  %  of  the  time  while  the  joint  events  {h,w)  were  already 
observed  30  %  of  the  time^. 

In  a  typical  backoff  model  a  prediction  miss  occurs  when  a  backoff  to  a 
lower  order  is  required.  Given  that  our  reference  model  is  a  word  3-gram, 
it  is  interesting  to  see  which  of  the  other  predictors  might  be  used  instead. 
Figure  5  gives  the  prediction  hit  ratios  of  all  predictors  when  the  word  3- 
gram  model  would  back  off  to  lower  order  predictors.  On  this  data  set,  the 
word  2-grams  could  be  used  in  77  %  of  these  cases  and  a  4-gram  model  with 
50  classes  could  also  be  used  43  %  of  the  time.  This  indicates  that  other 
predictors  in  the  lattice  could  overcome  a  prediction  miss  of  the  reference 
model. 


^Notice  that  the  4-gram  history  hit  ratio  is  not  necessarily  equal  to  the  3-gram  predic¬ 
tion  hit  ratio.  This  is  due  to  sentence  boundary  effects. 
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17% 


43% 


72% 


94% 


C50 


Figure  5:  Prediction  Hit  Ratio  when  word  trigram  backs  off 

The  analysis  of  the  prediction  hit  ratio  for  each  predictor  is  of  little  use 
in  practice  as  the  next  word  must  be  known  in  advance  in  order  to  determine 
whether  a  particular  predictor  would  hit  or  miss  it.  However  the  count  of  the 
history  is  known  before  the  prediction  occurs.  Figure  6  presents  the  relation 
l)etween  the  count  of  the  history  C{h)  of  any  of  the  16  predict ors^^  in  the 
lattice  and  the  prediction  hit.  The  data  was  gathered  by  pooling  together 
datapoints  from  all  16  predictors  applied  to  the  same  count.  As  it  may 
be  expected  the  prediction  hit  increases  rapidly  as  the  count  of  the  history 
increases^  F 


^^The  unigram  model,  absent  from  figure  6,  has  a  100  %  prediction  hit  ratio  since  the 
vocabulary  is  closed. 

Notice  the  logarithmic  scale  along  the  horizontal  axis  of  figure  6. 
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Figure  6:  Prediction  hit  ratio  as  a  function  of  history  count 


The  previous  analysis  can  be  refined  as  follows.  In  figure  7,  the  prediction 
hit  ratio  is  plotted  as  a  function  of  the  history  count  C{h)  and  the  entropy 
H{h).  In  particular,  over  all  predictors  such  that  C{h)  falls  between  100  and 
10,000  the  prediction  hit  will  be  relatively  higher  if  the  entropy  is  smaller. 
In  other  words,  among  all  predictors  falling  in  this  count  interval,  one  should 
prefer  the  most  specific  ones  -  the  ones  with  the  lowest  entropy.  It  is  inter¬ 
esting  to  note  that  90%  of  the  counts  actually  fall  in  this  interval  as  can  be 
concluded  from  the  histogram  of  counts  presented  in  figure  8. 
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Prediction  Hit  Ratio 


Figure  7:  Prediction  hit  ratio  as  a  function  of  history  count  and  entropy 

5  Smoothing  techniques 

Traditional  backoff  models  combine  several  predictors  to  overcome  the  ever 
present  data  sparseness  problem.  We  review  in  section  5.1  the  details  of  back¬ 
off  models  and  we  show  in  section  5.2  how  these  techniques  can  be  extended 
to  lattices  of  N-gram  models. 


5.1  Backoff  scheme 

Equation  9  describes  a  particular  backoff  scheme  where  dc  denotes  the  dis¬ 
counted  value  subtracted  from  the  counts  of  seen  events.  This  discounted 
value  may  depend  on  the  count  C{h,w)  as  in  Turing-Good  discounting  [21] 
or  may  be  constant  as  in  the  case  of  absolute  discounting  [22].  The  nor¬ 
malized  discounted  probability  mass  is  distributed  to  unseen  events  in  pro¬ 
portion  to  their  backoff  estimates  (pbackiM^))-  Here  the  backoff  distribution 
Pback{w\h)  is  only  used  if  the  higher  order  estimate  cannot  be  used,  that  is 
when  C{h,w)  =  0.  We  will  refer  to  this  particular  backoff  scheme  as  shad¬ 
owing  since  the  higher  order  estimate  shadows  the  backoff  distribution. 


C(h,w)-dc 

C{h) 

Ci{h)pback{w\h) 


if  C{h,w)  >  0 
otherwise 


(9) 


Equation  10  describes  a  different  backoff  scheme  in  which  the  backoff 
distribution  is  used  in  all  cases  and  the  normalization  factor,  here  7(h),  is 
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Figure  8:  Histogram  of  history  counts 


defined  accordingly.  We  will  refer  to  this  particular  backoff  scheme  as  non- 
shadowing^^ . 

p(w\h)  =  I  +  l{h)pbackiw\h)  if  Cih,  w)  >0 

I  l{h)Pbackiw\h)  otherwise 

In  both  of  these  schemes  the  backoff  distribution  is  given  by  the  lower 
order  estimate,  such  as  a  2-gram  serving  as  backoff  for  a  3-gram.  Kneser  and 
Ney  proposed  an  alternative  backoff  distribution  which  performs  better  [23]: 


where 


Pback{w\h) 


C{.,h,w) 

C{;  K  ^') 


(11) 


C(.,h,w)=  1 

g:g=h,C(g,w)>0 

Here  h  denotes  a  coarser  history  that  is  typically  a  2-gram  history  if  h 
denotes  a  3-grarn  history.  C{,^h^w)  corresponds  to  the  number  of  different 
coarser  histories  h  where  the  word  w  has  been  observed  ignoring  the  frequency 
of  these  events. 

In  summary  there  are  at  least  four  possible  methods  of  smoothing  avail¬ 
able.  We  can  decide  whether  or  not  shadowing  is  used  and  whether  or  not 

^^Ney  et  al.  uses  the  term  non  linear  interpolation  [22], 
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Baseline 

-|-KN  backoff 
distributions 

-|- non-shadowing 

C50  2g 
3g 

125 

127 

109 

117 

C300  2g 
3g 

102 

100 

102 

99 

89 

87 

91 

82 

C1600  2g 
3g 

98 

96 

98 

94 

85 

84 

85 

78 

Word  2g 

3g 

97 

95 

97 

94 

84 

83 

84 

77 

Table  2:  Comparison  of  held-out  perplexities  for  various  backoff  schemes 

the  Kneser-Ney  (KN)  backoff  distributions  are  used.  Table  2  summarizes 
the  results  obtained  on  the  Switchboard  data  with  various  model  orders  and 
different  class  models.  The  baseline  corresponds  to  the  use  of  shadowing 
together  with  the  use  of  lower  order  estimates  as  backoff  distributions. 

5.2  Two-dimensional  backoff 

The  lattice  structure  (see  figure  1)  suggests  an  extension  to  the  original  back¬ 
off  idea.  If  for  example  we  consider  a  3-gram  predictor  at  the  word  level, 
there  are  two  adjacent  predictors  which  can  be  used  as  backoff  distributions, 
namely  a  2-gram  at  the  word  level  and  another  3-gram  at  the  next  class  level. 

Considering  two  backoff  distributions  was  already  proposed  in  a  differ¬ 
ent  context  [24]  where  a  speaker  specific  language  model  was  combined  with 
a  non-specific  language  model.  In  that  case  however  a  hierarchy  between 
the  backoff  distributions  was  defined  a  priori.  To  the  contrary  in  the  model 
proposed  here  both  backoff  distributions  are  combined  by  linear  interpola¬ 
tion.  Equations  12  and  13  formalize  this  idea  in  the  case  of  shadowing  and 
non-shadowing  respectively. 


p{w\h) 


C{h,w)—dc 

C{h) 

a{h,  X)[\lPbacki  {w\h)  +  X2Pback2{w\h)] 


if  C{h,  rc)  >  0 
otherwise 


(12) 


15 


p{w\h)  = 


+  l{h)[>^lPhach  (^1^)  +  ^2Pback2  (^<^|^)]  if  C{h,  w)  >  0 
'y{h)[Xipbacki  (^1^)  +  -^2^500^2(^1^)]  otherwise 

(13) 


5.3  EM  estimation  of  the  backoff  weights 

We  show  how  to  estimate  the  interpolation  weights  in  the  case  of  shadowing 
from  some  representative  set  of  new  data. 

Let  A  denotes  the  pair  (Aj,  A2)  with  the  constraint  Yli  A^  =  1.  Notice  that 
A  appears  in  the  normalization  factor  a{h,  A)  of  equation  12.  We  can  write 


where 


a(h, A) 


Kjh) 

Al5l(/>)  +  >^2S2{h) 


(14) 


K{h)  = 


E 

w:C{h,w)>0 


dc 

c(hY 


w:C{h^w)—0 


1“  H  PbackAMh)-, 

w:C{h^'w)yO 


^2(^)  ^  ^  Phack2{,'^\h^ 

w:C[h^w)=0 


1-  Z]  Phack2{w\h). 

w\C{h^w)'>0 


Let  px{w\h)  denote  the  interpolated  backoff  distribution 


Px{w\h)  =  XlPbackAMh)  4-  X2Pback2{w\h) 

The  re-estimation  formula  can  be  derived  from  an  auxiliary  function  Q 
representing  the  difference  in  the  conditional  expectation  of  the  complete 
data  log-likelihoods  given  the  observed  data.  The  hidden  data  is  the  actual 
sequence  of  backoff  distributions  used  while  predicting  the  new  data.  The 
function  Q  is  defined  as  follows: 


B 


Oi{X'  ,h)px<{w\h) 
ot{X,h)px{w\h) 


(15) 


16 


where  B  denotes  the  set  of  backoff  events  on  the  new  data  and  p{h,w)  de¬ 
notes  the  relative  frequency  of  the  event  {h,w)  on  this  set.  We  can  rewrite 
ecjuation  15  as  follows 


Q(A'|A)=  ZPHh)  log  -  ^p{w\h)  log 

JD 


> 


> 


Ep(w|h)log[ 


Px(w\h)  ^ 

^iPbacki  (^1^) 
Pxiw\h)  Xi 

S^  {h)-\-X!,S2(h) 


+ 


XlSi(h)  +  X2S2(h) 
>^2Pback2iM^)  _ 

Px(Mh)  ^2-1 


^p{w\nj  log  XiSi{h)+\2S2{h) 

EpH/>)[^i^ggi^iogA;  + 

i-5pM/.)StiS?ii 


XoPbackoiM^)  1  \/ 

-;rHX^logA2 


d-C’] 


A2Pfcac)c;(M'|fe) 

Px{w\h) 


log  A2  +  C*]  + 


where  the  first  inequality  is  an  application  of  Jensen’s  inequality^^,  C  does  not 
depend  on  A'  and  the  second  inequality  results  from  the  fact  that  —  log  x  > 
1  -  .r. 

Computing  the  partial  derivative  of  Q  with  respect  to  Aj 


dQ  _ 
d\[ 


J2piw\h) 

B 

T,p{w\h) 

B 


Mvbacky  (yj\h)  1 
px{w\h)  A'j 
Si(h) 

XiSi{h)+X2S2{h) 


and  setting  ^  —  0,  we  get  the  re-estimation  formula. 


Ep(w\h) 

B 


AlPbacfci  (^1^) 

px(Mh) 


'Zp{w\h) 

B 


Si{h) 

AiSi(/i)+A2S2(/i) 


(16) 


(17) 


The  numerator  on  the  right  hand  side  of  equation  17  is  analogous  to 
that  found  in  the  classical  re-estimation  formula  to  compute  interpolation 
weights  between  various  language  models.  The  difference  here  is  that  the 
sum  is  over  the  set  which  is  the  set  of  events  in  which  a  backoff  occurred 
while  predicting  the  new  data.  The  main  difference  in  this  model  lies  in  the 
denominator,  as  the  normalization  factor  a{X,h)  is  a  function  of  A. 

In  the  case  of  non-shadowing  the  derivation  is  somewhat  simpler  and  the 
re-estimation  formula  is  given  by  equation  18 


/  is  a  convex  function  and  X  a  random  variable,  then  f[E{X)]  <  E[f{X)]. 
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Model 

PP 

Word  3g 

84 

Lattice 

82 

Lattice-flinear  interpolation 

79 

Linear  interpolation 

79 

Table  3:  Test  set  perplexity  with  shadowing 


Ai  = 


'Ep{w\h) 


+ 

7(fe)Al.glW 


B 


Po{'l^\h)+'y(h)(\ipt,acki  {‘^W  +  >^2Pback2{'^\h)) 


(18) 


where  (7  is  a  normalization  constant  that  satisfies  the  constraint  =  1, 
B  is  the  set  of  events  {h,w)  on  the  new  data  such  that  C{h,w)  =  0  on  the 
training  data,  B  is  the  complement  of  B,  and  poiw\h)  = 

Generalization  of  the  proposed  approach  to  more  than  two  backoff  distri¬ 
butions  is  trivial  and  will  not  be  detailed  here.  Moreover  this  model  is  not 
restricted  to  lattices  of  N-grams  but  can  be  applied  to  other  cases  where  sev¬ 
eral  backoff  distributions  are  relevant.  Finally,  one  should  stress  that  backoff 
models  are  usually  applied  recursively  to  several  predictors  .  The  proposed 
re-estimation  should  then  be  applied  first  with  less  specific  predictors  and 
then  iterated  with  more  specific  ones.  There  is  no  guarantee  however  that 
such  an  iterated  re-estimation  will  globally  maximize  the  likelihood  of  the 
most  specific  model. 


5.4  Preliminary  results 

We  describe  in  this  section  some  preliminary  results  obtained  on  the  Switch¬ 
board  data  with  lattices  of  N-gram  models  smoothed  with  two-dimensional 
backoff.  In  these  experiments  we  restricted  the  lattices  to  3-gram  models 
with  4  class  levels.  Figure  9  shows  the  interpolation  weights  (plain  arrows) 
estimated  on  held-out  data.  For  example,  from  a  3-gram  at  the  word  level 
the  backoff  weight  of  the  word  2-gram  is  0.58  while  it  would  be  1.0  in  a 
traditional  one-dimensional  backoff  scheme.  A  conventional  interpolation  of 
the  4  higher  order  predictors  can  be  performed  on  top  of  the  two-dimensional 
backoff.  The  so-called  global  interpolation  weights  are  represented  by  dashed 
arrows  in  figure  9. 

Table  3  presents  the  test  set  perplexity  obtained  with  these  models  in  the 
case  of  shadowing.  The  first  line  corresponds  to  the  reference  model,  that  is 
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1.0 


C50 


a  word  3-gram  model.  The  second  line  gives  the  test  set  perplexity  of  the 
model  obtained  after  estimation  of  the  backoff  weights  on  held-out  data.  The 
third  line  shows  the  additional  improvement  which  can  be  obtained  with  the 
global  weights.  This  last  result  represents  a  6  %  perplexity  reduction  over  the 
reference  model.  However,  the  same  reduced  perplexity  can  be  obtained  with 
global  interpolation  only^^.  A  similar  conclusion  can  be  drawn  from  table  4, 
in  which  non-shadowing  is  used  and  the  bigram  predictors  are  replaced  by 
their  corresponding  KN  distributions.  As  pointed  out  in  section  5.3  there  is 
no  guarantee  that  the  iterated  reestimation  globally  maximizes  the  likelihood 
of  the  most  specific  model.  We  observe  here  a  practical  case  where  lattice 
based  language  models  do  not  outperform  linearly  interpolated  models. 

This  result  is  somewhat  surprising,  as  the  two-dimensional  backoff  model 
contains  more  free  parameters.  Additional  experiments  should  be  performed 
to  confirm  the  source  of  this  result.  In  particular,  larger  lattices  could  be 
used  and  two-dimensional  backoff  could  be  generalized  to  more  than  two 
predictors.  Another  interesting  extension  would  rely  on  the  definition  of 
backoff  weights  depending  on  the  history  h.  In  such  a  case,  the  weights  could 

this  case  the  backoff  weights  are  fixed  to  0.0  in  the  vertical  direction,  and  1.0 
otherwise. 
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be  optimized  for  each  history  instead  of  globally  with  only  a  few  additional 
free  parameters  (6  in  the  previous  example). 


6  Dynamic  combination  of  predictors 

6.1  Predictor  combination 

Another  way  of  looking  at  the  lattice  presented  in  figure  1  is  to  consider 
each  node  as  the  starting  point  for  a  recursive  backoff  scheme  that  is  limited 
to  progressing  in  one-dimension  -  the  horizontal  one.  Figure  10  presents 
the  perplexity  obtained  using  this  scheme  on  the  held-out  set  from  each  lat¬ 
tice  node.  In  this  case,  KN  distributions  are  used  as  backoff  distributions 
combined  with  non-shadowing.  The  reference  word  3-gram  model  has  a  per¬ 
plexity  of  77  which  is  only  slightly  outperformed  by  the  word  4-gram  and 
5-gram  models.  The  dynamic  combination  of  predictors  can  then  be  per¬ 
ceived  as  a  way  to  combine  these  17  predictors  in  order  to  improve  over  the 
reference  model. 


110 


82  I 


5-gram  4-gram  3-gram  2-gram  1-gram 

Figure  10:  Held-out  perplexity  with  KN  distributions.  Each  node  represents 
a  predictor  which  starts  at  that  node  and  backs  off  horizontally. 
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6.2  Predicting  the  Oracle  decisions 


P^lh) 


^,Uh) 


H<h)  C(h)  Class  Order 


P(wlh) 


Figure  11:  Use  of  preclirlui  variahh's  to  choose  the  optimal  combination 

Su])pose  we  had  an  oracle  lliai  knew  which  of  the  17  predictors  should  be 
used  in  any  given  context  to  predict  the  next  word.  A  very  well  informed 
oracle  could  look  at  the  word  to  Ix'  predicted  and  pick  the  best  predictor 
accordingly.  The  perplexity  on  the  lield-out  set  would  then  drop  to  38. 

.A  more  realistic  framewoik  i^  to  combine  the  predictors  linearly  but  to 
adapt  dynamically  the  in1<-i  point  ion  weights  between  the  predictors.  To  do 
this,  we  can  rely  on  four  pie(li<  toi'  \ariables:  the  entropy  H[h),  the  count 
C(h).  the  number  of  classes  ami  the  model  order  (see  figure  11). 

Using  these  predictor  vari.ddes.  a  (h'cision  tree  can  be  built  from  the  held- 
out  set  to  predict  the  probabiliiv  that  any  predictor  would  outperform  the 
reference  model  by  some  (actor  I  typically  set  to  1.5).  Such  a  decision  tree 
is  presented  in  figure  12  where  left  biatiches  correspond  to  “yes”  answers. 
For  example  the  probabilitx  that  an>'  randomly  selected  predictor  (except 
the  word  3-gram  itself)  would  siunilicantly  outperform  the  reference  model 
is  0.18  in  general.  This  probability  drops  to  0.01  when  the  number  of  classes 
is  below  950  and  the  count  of  the  history  is  above  126,204. 

The  relative  weight  of  anv  prc'dictor  can  be  made  proportional  to  the 
probability  attached  to  th<-  leaf  into  which  it  falls.  Notice  that  even  though 
the  decision  tree  is  fixed,  the  w<>ighf  applied  to  a  predictor  changes  dynami¬ 
cally,  as  H{h)  and  C{h)  depruid  on  the  observed  history. 

Table  5  presents  test  set  |)er|)l('xities  that  result  from  experiments  based 
on  this  approach^^.  The  jx't  ph'xity  of  the  best  reference  model  (word  3gram 
with  non-shadowing  and  K.\  distributions)  is  77.  A  static  interpolation  with 

’^®The  decision  tree  used  in  practice  was  more  detailed  than  the  one  shown  in  fignre  12. 
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Prob[  PUh)>PgUh)*1.5] 


Prob=0.18 

_ I _ 


CLASS  <  950 


Prob  =  0.21 
C(h)  <  126204 

_ i_ 


Prob  =  0.16 
Order  <  2.5 


Prob  =  0.21 
H(h)  <  5.98 


Prob  =  0.01 


Prob  =  0.19 
H(h)<5.18 


Prob  =  0.11 
Order  <  3.5 


Figure  12:  Decision  tree 


Model 

PP 

Word  3g 

77 

Lattice  +  Interpolation 

74 

Lattice  +  Decision  Trees 

72 

Table  5:  Test  set  perplexities 


globally  optimized  but  fixed  weights  yields  a  perplexity  of  74.  The  dynamic 
combination  of  the  lattice  predictors  gives  a  perplexity  of  72. 
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7  Summary  and  Conclusions 

In  this  report  we  introduced  lattice  based  language  models  —  an  alternative 
language  modeling  paradigm  which  we  have  just  started  exploring.  Lattice 
based  models  construct  multi-dimensional  hierarchies  of  partitions  and  then 
select  the  most  promising  partitions  (nodes)  to  generate  the  estimated  dis¬ 
tributions. 

We  discussed  a  specific  two  dimensional  lattice,  where  the  first  dimension 
is  the  length  of  the  history  equivalence  class,  and  the  second  dimension  is  the 
position  in  a  word  class  hierarchy.  As  originally  defined  such  a  lattice  is  in 
fact  a  DAG,  since  subsumption  relations  exist  among  neighboring  vertices. 
Simple  set  subtraction  operations  can  remove  these  data  dependencies. 

Next,  we  considered  which  features  of  the  lattice  nodes  are  indicative  of 
their  usefulness,  and  proposed  the  use  of  two  primary  ones:  the  training- 
set  history  count  of  the  node,  and  the  (smoothed)  entropy  of  its  prediction. 
Using  the  SWITCHBOARD  corpus,  we  constructed  a  two  dimensional,  17 
node  lattice,  and  calculated  history  and  prediction  hit  ratios  for  all  its  nodes 
using  held  out  data.  We  then  demonstrated  how  the  prediction  hit  ratio 
depends  strongly  on  both  the  count  of  the  history  and  its  entropy,  thus 
justifying  our  original  choice.  ■ 

After  discussing  various  smoothing  techniques,  we  proposed  a  straight¬ 
forward  generalization  of  the  conventional  backoff  strategy  to  multiple  di¬ 
mensions,  and  derived  the  formula  for  calculating  the  optimal  interpolation 
weights,  using  the  Estimation-Maximization  (EM)  algorithm.  This  simple 
model  provided  a  modest  improvement  over  the  baseline  trigram.  Another 
interpolation  scheme,  using  the  same  predictor  set,  achieved  the  same  per¬ 
formance. 

The  true  strength  of  lattice  models,  we  believe,  lies  in  dynamic  selection 
of  a  small  subset  of  predictor  nodes.  How  to  select  such  a  set  is  an  open  and 
interesting  research  problem,  which  we  have  just  begun  to  look  at.  Oracle 
experiments  suggest  that  significant  improvements  are  possible  if  we  choose 
the  predictor  set  correctly.  And  indeed,  an  initial  attempt  at  using  a  decision 
tree  to  make  that  selection  yielded  some  improvement.  We  believe  much  more 
improvement  is  possible,  and  are  hoping  to  explore  this  problem  in  greater 
detail  in  the  future. 
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